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1. Introduction 


Air quality in the Lombardy region (northern Italy) is affected by high concentrations of pol- 
lutants. One of the reasons is that Lombardy is localised in the Po Valley where air circulation is 
very weak because of the mountains surrounding the area. The peculiar weather conditions, the 
industrial development and the population density make Lombardy one of the worst European 
region in terms of air quality!. As a result, epidemiological studies have found that Lombardy 
is characterized by an elevated mortality rate related to fine particulate matter (PM>5) expo- 
sure. It is well known that a considerable part (from 10% up to 50%) of the PM; 5 is formed by 
the chemical reactions of the ammonia (NH3) with other precursors. In the Lombardy region, 
97% of the total NH; gaseous emissions are linked to the agriculture sector (INEMAR - ARPA 
Lombardia, 2022). Considering that Lombardy is the leading region in Italy for agriculture pro- 
duction, with the highest regional density of swine and bovines, it is clear that the agriculture 
section has a considerable impact on air quality. 

The project Agriculture Impact On Italian Air Quality, hereafter AgrimOnlIA, aims to es- 
timate the local impact of ammonia emissions on particulate matters (PMio and PM. 5). This 
information can be crucial for the policy-makers who have to prioritise interventions. AgrI- 
mOnlA is an ongoing research project, promoted and financed by Fondazione Cariplo within 
the framework of Data Science for science and society. More information on the project are 
available on https: //agrimonia.net/. 

In this work, we present preliminary results providing continuous spatial maps of PM; 5 con- 
centrations (with daily temporal scale) in the Lombardy region using the AgrimOnIA dataset’, 
which contains harmonised data on meteorology, emissions and land use. We implement three 
spatial prediction methods whose performance will be compared by using standard indexes 
computed with the Leave-One-Out Cross-Validation strategy (LOOCV). In particular, we con- 
sider a spatio-temporal Kriging model with external drift, and two random forest algorithms 
which include spatial and temporal components. 


2. Data 


The AgrimOnlIA dataset is an open access data set containing Air Quality (AQ), Weather 
(WE), Land cover (LA), Emission (EM) and Livestock (LI) data with daily temporal resolution. 
The data are available for all the air quality monitoring stations after a pre-processing step to 
change the support of spatial data from area to point, when necessary. We consider the period 
from 2017 to 2020. The area covered by the AgrImOnIA dataset includes the Lombardy region 


‘https: //www.eea.europa.eu//publications/air-—quality-in-europe-2021 
?The AgrimOnlA dataset will soon be available on Zenodo, which is an open repository operated by CERN 
(https://zenodo.org/). 
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Figure 1: Area of interest with PM3 5 monitoring stations, classified by type: rural background (RB); 
rural industrial (RI); suburban background (SB); suburban traffic (ST); urban background (UB); urban 
industrial (UD; urban traffic (UT). 


and a neighbouring area defined by applying a 0.3-degree buffer to the regional boundaries. 
The area and the PM2.5 monitoring network considered by the Agr/mOnIA project can be seen 
in Figure 1. 
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Figure 3: Time series of PM3 5 concentrations (top) and NH3 agriculture emissions (bottom) for all the 
monitoring sites 


From the AQ available variables, we selected PM; 5 (AQ_pm25, in ug/ m?) as the response 
variable in logarithmic scale. Figure 2 shows the annual mean of PM, 5 concentrations in each 
monitoring station: higher values are located in the lower Po Valley, particularly in the provinces 
of Milan, Cremona, Lodi and Brescia. The other selected variables are described in Table 1. 
The overall NH3 emissions from the agriculture sector (nh3_agr) are calculated by summing 
up NH3 emissions from manure management, agriculture soil and agriculture waste burning 
(EM_nh3_livestock_mm + EM_nh3_agr_soils + EM_nh3_waste_burn). To generate 
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continuous maps as output, the covariates are also obtained on a regular grid 0.1° x 0.1°. Figure 
3 displays the PM2 5 and NH3 daily time series for all the monitoring stations. As it can be seen, 
PM; 5 concentrations follow a seasonality with peaks during the winter, while ammonia shows 
higher values in summer, likely because of a higher uses of fertilisers. 


Table 1: Description of the selected variables 


Variable description Variable name Unit Source 
Longitude of the monitoring stations Longitude degree AgrImOnIA dataset 
Latitude of the monitoring stations Latitude degree AgrImOnIA dataset 
Date Time Date AgrImOnIA dataset 
Altitude of the monitoring stations Altitude m AgrImOnIA dataset 
PM> 5 concentrations AQ_pm25 ug/ m3 AgrImOnlIA dataset 
Temperature of air at 2 m WE_2m_temperature °C AgrImOnIA dataset 
Mean horizontal wind speed at 10 m WE_wind_speed_10m_mean m/s AgrImOnIA dataset 
The accumulated water fallen WE_tot_precipitation m AgrImOnIA dataset 
The pressure of the atmosphere WE_surface_pressure Pa AgrImOnIA dataset 
Net solar radiation WE_solar_radiation Jim AgrImOnIA dataset 
High vegetation index WE_hvi m? /m? AgrImOnIA dataset 
Low vegetation index WE_lvi m? /m? AgrImOnIA dataset 
Mean horizontal wind speed at 100 m WE_wind_speed_100m_mean m/s AgrImOnIA dataset 
Maximum of boundary layer height WE_blh_layer_max m AgrImOnIA dataset 
Minimum of boundary layer height WE_blh_layer_min m AgrImOnIA dataset 
Maximum of relative humidity WE_rh_max % AgrImOnIA dataset 
NH3 emissions - manure management EM_nh3_livestock_mm mg/m? AgrImOnIA dataset 
NH3 emissions - agriculture soil EM_nh3_agr-_soils mg/m? AgrImOnIA dataset 
NH3 emissions - agriculture waste burning EM_nh3_waste_burn mg/m? AgrImOnIA dataset 
NH3 total emissions EM_nh3_sum mg/m? AgrImOnIA dataset 
SO2 total emissions EM_so2_sum mg/m? AgrImOnIA dataset 
NOx total emissions EM_nox_sum mg/m? AgrImOnIA dataset 
NH3 emissions - agriculture sector nh3_agr mg/m? Own elaboration 
Day of the week day_week Categorical Own elaboration 
Type of season season Categorical Own elaboration 
Type of station type-_station Categorical Own elaboration 


(see Figure 1) 


3. Spatial prediction techniques 


In order to perform spatial prediction and to produce continuous spatial maps of PM; 5 
concentrations, we consider two approaches: 1) spatio-temporal kriging with external drift 
(STKED), a classical approach in geostatistics framework; 2) a well-known machine learn- 
ing method - random forest (RF) - extended to the case of data correlated in space and time. 


Spatio-temporal kriging with external drift 
Spatio-temporal kriging is a supervised parametric model which assumes that the observed 


PM,,; data are generated by a given stochastic spatio-temporal model. In particular, we sup- 
pose that the response variable log(AQ_pm25) is Normally distributed with a mean changing in 
space and time and a variance given by the measurement error variance (i.e. the nugget). The 
mean of the response field is in turn defined as the sum of a large-scale trend (or external drift, 
which includes the linear effect of the covariates described in Table 1), and a residual spatio- 
temporal process with separable space-time covariance function. For the implementation of this 
method we use the gst at R-package (Griler et al., 2016), which requires the estimation of a 
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spatio-temporal variogram. Once all the models parameters are estimated, spatial prediction of 
the expected log(AQ_pm25) value at any location in Lombardy region is straightforward: it is 
given basically by a local weighted mean of the spatio-temporal residuals plus the large scale 
component (evaluated using the covariate values at the new sites). 


Random forest for spatio-temporal data 

Random forest is a data-driven non-parametric machine learning technique given by an en- 
semble of regression trees fitted using several bootstrapped version of the original data and 
subsets of the considered covariates. The main limitation of random forest is that it is not able 
to take directly into account the temporal and spatial correlation, as kriging does. In order to in- 
clude in the fitting algorithm some information about the data spatial structure, we propose here 
two different implementations of the method: 1) the standard RF algorithm (RFbase) which 
includes in the covariate set, besides the variables described in Section 2. also the spatial coordi- 
nates (longitude and latitude) of each observation; 2) the spatial RF (RFsp) method proposed by 
Heng] et al. (2018). This method expands the set of covariates by including the buffer distances 
from the observation sites (i.e., if we have n monitoring stations we will have n additional 
columns in the training set each referring to a given station and including the distances from the 
remaining locations). To take into account the temporal component we consider as covariates 
the date of the day, the day of the week and the type of season. The two RF algorithms are im- 
plemented using the Ranger R-package. Prediction in a new spatial location is usually given 
by the averages of the, say, B predictions computed using the single trees in the forest. Indeed, 
we consider an ordinary spatio-temporal kriging model for the differences between observed 
and predicted data in order to include a term taking into account spatio-temporal correlation 
and predict a term for the small scale component. 


4. Preliminary results 


Starting considering the STKED technique, we subset covariates through stepwise strategy 
and we estimate the coefficients shown in Table 2 also referred to interaction terms between 
season and emissions. We can see that, among the emissions, nh3_agr has a larger impact 
on log(AQ_pm25) during the winter, while EM_nox_sum shows larger effects in the remaining 
seasons; this is consistent with the results in Thunis et al. (2021). The sample variogram of the 
residuals of the large-scale component is used to choose the exponential variogram model. 

Figure 4 (right) shows the 2020 mean of the daily predicted PM» 5 concentrations in the area 
of interest. It is worth to note that higher PM; 5 concentrations are predicted where we observe 
higher NH3 emissions from the agriculture sector, as shown in Figure 4 (left). 

As for the ML approach, the variable importance analysis returns similar results for RFbase 
and RFsp. Figure 5 shows the weather components as the most important, in accordance with 
the literature (Cameletti et al., 2011) together with the temporal components. Moreover, it turns 
out that the euclidean distances between sites (dist_from_) is not very important. 

The comparison between the prediction capability of the three models is performed through 
LOOCYV and the results are shown in Table 3. STKED shows higher performance compared to 
the two versions of RF, although it is worth to note that these results are based on a preliminary 
version of the AgrimOnlIA dataset. 


5. Discussion and further development 


Further developments of this work will consider the forthcoming versions of the AgrimOnIA 
dataset and extensions of the considered techniques, always in the framework of the comparison 
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Table 2: Large-scale component coefficient estimates of STKED 


Dependent variable: log(AQ_pm25) 


Covariate Estimate SE 
Altitude —0.152*** 0.003 
Latitude —0.032*** 0.003 
Longitude —0.022*** 0.003 
WE_sum_total precipitation —0.119*** 0.003 
WE_mean_wind_speed_10m —0.142*** 0.003 
WE_min_boundary_layer_height —0.082*** 0.003 
seasonautumm —0.149*** 0.008 
seasonspring —0.411*** 0.008 
seasonwinter 0.600*** 0.010 
EM_so2_sum —0.038*** 0.007 
EM_nox_sum 0.167*** 0.007 
nh3_agr 0.111*** 0.008 
seasonautumm:EM_so2_sum —0.017* 0.010 
seasonspring:EM_so2_sum 0.003 0.010 
seasonwinter:EM_so2_sum 0.022** 0.009 
seasonautumm: EM_nox_sum 0.027** 0.013 
seasonspring:EM_nox_sum —0.116*** 0.016 
seasonwinter:EMnox_sum —0.077*** 0.008 
seasonautumm:nh3_agr —0.130*** 0.009 
seasonspring:nh3_agr —0.078*** 0.009 
seasonwinter:nh3_agr 0.195*** 0.016 
Constant 2.774*** 0.005 
Observations 70,119 

R? 0.325 

Adjusted R? 0.325 

Residual Std. Error 0.664 (df = 70097) 

F Statistic 1,607.042*** (df = 21; 70097) 

Note: *p<0.1; **p<0.05; ***p<0.01 


Figure 4: Mean of daily NH3 emissions from agriculture over the period 2017-2020 (left); mean of 
daily PM2.5 concentrations predicted by STKED for 2020 (right). 


Table 3: LOOCV comparison of the models described in Section 3. 


MAE RMSE BIAS COR 

STKED 0.316 0.510 0.021 0.782 
RFbase 0.352 0.543 -0.022 0.741 
RFsp 0.361 0.547 -0.041 0.738 
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Figure 5: Variable importance of the spatial random forest (RFsp) measured by the RSS mean reduction 
for each variable. 


between classical geostatistics and machine learning approaches. 

Recent studies found that NH3 emissions reductions are the most cost-effective way to re- 
duce PM, concentrations (Gu et al. 2021). Scenario analysis based on spatial prediction tech- 
niques, in compliance with the Regional Plane for emissions reductions?, will allow to assess 
the expected impact of NH3 emissions reduction’s policies before their implementation. 
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